Comparing Distributed Indexing: To MapReduce or Not?
نویسندگان
چکیده
Information Retrieval (IR) systems require input corpora to be indexed. The advent of terabyte-scale Web corpora has reinvigorated the need for efficient indexing. In this work, we investigate distributed indexing paradigms, in particular within the auspices of the MapReduce programming framework. In particular, we describe two indexing approaches based on the original MapReduce paper, and compare these with a standard distributed IR system, the MapReduce indexing strategy used by the Nutch IR platform, and a more advanced MapReduce indexing implementation that we propose. Experiments using the Hadoop MapReduce implementation and a large standard TREC corpus show our proposed MapReduce indexing implementation to be more efficient than those proposed in the original paper.
منابع مشابه
Performance Optimization of a Distributed Transcoding System based on Hadoop for Multimedia Streaming Services
In recent times, Hadoop based on the MapReduce model has gained considerable attention because the features of the data preprocessing techniques are not timeconsuming and are suitable for processing large-scale data. In particular, MapReduce is emerging as an important programming model for developing distributed dataprocessing applications such as web indexing, data mining, log file analysis, ...
متن کاملA Survey on MapReduce Performance and Hadoop Acceleration
MapReduce is implementation for generating large data sets with a parallel, distributed algorithm on a cluster. Hadoop is open source implementation of the MapReduce programming datamodel used for large-scale parallel applications such as web indexing, data mining, and scientific simulation. Hadoop-A framework is able to levitate Hadoop acceleration and give significant performance compared to ...
متن کاملResearch on Multi-Tenant Distributed Indexing for SaaS Application
Multi-tenant is the key feature for SaaS application, however, the traditional indexing mechanism has failed in multi-tenant shared scheme database. This paper proposed a multi-tenant distributed indexing mechanism. We create a global index first and then create the local index by MapReduce framework based on Hadoop. We also proposed the process of index update and index merging. Experimental r...
متن کاملSciPDFindexer: Distributed Information Retrieval system using MapReduce
Indexing allows the conversion of raw document collections into easily searchable formats. Bigger scale indexing poses some challenges in terms of efficiently distributing indexing computation on a cluster of nodes. MapReduce framework promises to be an effective tool for parallelizing such tasks as inverted index construction. We propose SciPDFindexer, a distributed information retrieval syste...
متن کاملOf Ivory and Smurfs: Loxodontan MapReduce Experiments for Web Search
This paper describes Ivory, an attempt to build a distributed retrieval system around the open-source Hadoop implementation of MapReduce. We focus on three noteworthy aspects of our work: a retrieval architecture built directly on the Hadoop Distributed File System (HDFS), a scalable MapReduce algorithm for inverted indexing, and webpage classification to enhance retrieval effectiveness.
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2009